Cost functions for pairwise data clustering 
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Cost functions for non-hierarchical pairwise clustering are 
introduced, in the probabilistic autoencoder framework, by the 
request of maximal average similarity between the input and 
the output of the autoencoder. The partition provided by these 
cost functions identifies clusters with dense connected regions 
in data space; differences and similarities with respect to a well 
known cost function for pairwise clustering are outlined. 
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, Clustering methods aim at partitioning a set of data- 
points in classes such that points that belong to the same 
class are more alike than points that belong to different 
classes Q . These classes are called clusters and their num- 
iber may be preassigned or can be a parameter to be deter- 
mined by the algorithm. There exist applications of clus- 
tering in such diverse fields as pattern recognition ||^, as- 
trophysics 1^, communications [Q, biology business Q 
and many others. Two main approaches to clustering can 
be identified: parametric and non-parametric clustering. 
I Non-parametric approaches make few assumptions about 
about the data structure and, typically, follow some local 
criterion for the construction of clusters. Typical exam- 
ples of the non-parametric approach are the agglomerative 
and divisive algorithms that produce dendrograms. In the 
last years non-parametric clustering algorithms have been 
introduced employing the statistical properties of physical 
systems. The Super-Paramagnetic approach by Domany 
and coworkers Q exploits the analogy to a model gran- 
ular magnet: the spin-spin correlation of a Potts model, 
living on the data-points lattice and with pair-couplings 
decreasing with the distance, is used to partition points 
in clusters. The synchronization properties of a system of 
coupled chaotic maps are used in Q to produce hierarchical 
clustering. 

' Parametric methods make some assumptions about the 
underlying data structure. Generative mixture models |^ 
treat clustering as a problem of density estimation: data 
are viewed as coming from a mixture of probability dis- 
tributions, each representing a different cluster, and the 
parameters of these distributions are adjusted to achieve a 
good match with the distribution of the input data. This 
can be obtained by maximizing the data likelihood (ML) 
or the posterior (MAP) if additional prior information on 
the parameters is available [nO[ . 



Many parametric clustering methods are based on a cost 
function: the best partition of points in clusters is assumed 
to be the one with minimum cost. Often cost functions 
incorporate the loss of information incurred by the cluster- 
ing procedure when trying to reconstruct the original data 
from the compressed cluster representation: the most pop- 
ular algorithm to optimize a cost function is ii'- means B . 
Starting from a statistical ansatz and invoking maximum 
likelihood leads to a cost function which has been observed 
to work for clustering financial time series |pl]] . 

It is important to stress the difference between central 
clustering, where it is assumed that each cluster can be rep- 
resented by a prototype [ p^ , and pairwise clustering where 
data are indirectly characterized by pairwise comparison 
instead of explicit coordinates [ p^ ; pairwise algorithms re- 
quire as input only the matrix of dissimilarities. Obviously 
the choice of the measure of dissimilarity is not unique and 
it is crucial for the performance of any pairwise cluster- 
ing method. It is worth remarking that it often happens 
that the dissimilarity matrix violates the requirements of 
a distance measure, i.e. the triangular inequality does not 
necessarily holds. 

Folded Markov chains are used in the Probabilistic Au- 
toencoder Framework to derive cost functions for clustering 
[0. Some examples of two-stage folded Markov chains, 
and the corresponding algorithms for clustering and to- 
pographic mapping [p^, are thoroughly analyzed in | [T^ , 
where it is also shown that the cost function for pairwise 
clustering, introduced in [p^ , may be seen as a consequence 
of Bayes' theorem and the requirement of minimal average 
distorsion in a probabilistic autoencoder. 

It is the purpose of this work to introduce a new class 
of cost functions for pairwise clustering which can be ob- 
tained, in the autoencoder frame, by requiring maximal 
similarity instead of minimal distorsion. We show that the 
cost functions here introduced provide a non-hierarchical 
clustering of points where dense connected regions of points 
in the data space are recognized as clusters. 

Let us now discuss autoencoders described by one-stage 
folded Markov chains. Let us consider a point x, in a data 
space, sampled with probability distribution Pq (x); a code 
index a G {1, . . . , g} is assigned to x according to condi- 
tional probabilities P {a\x). A reconstructed version of the 
input, x', is then obtained by use of the Bayesian decoder: 
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P{x'\a) = 



P{a\x')Po{x') 
P{a) 



(1) 



The joint distribution of a;, x' and a, describing this 
encoding-decoding process, is 



P (x, x', a) = Po (x) P {a\x) P {x'\a) ; 
owing to the joint distribution reads: 



(2) 



p. , . Po{x)Poix')Pia\x)P{a\x') 

P(x,x ,a)^ — . (3) 

P(a) 

The conditional probabihties {P (aja;)} are the free param- 
eters that must be adjusted to force the autoencoder to 
emulate the identity map on the data space. 

Let d{x, x') be a measure of the distorsion between input 
and output of the autoencoder. The average distorsion is 
then given by: 



V^Y^ J dx J dx 



,Po{x)Po{x')P{a\x)P{a\x') 
P{a) 



d{x, x'). 



(4) 



Moreover, let s(x,x') be a measure of the similarity between 
input and output; the average similarity is then given by 



j dx j dx 

a—l 



,Po{x)Po{x')P{a\x)P{a\x') 
P(a) 



s(x, x'). 



(5) 



It is natural to postulate a one-to-one mapping between 
values of distorsion and similarity, s ~ Fid), with F a 
strictly decreasing function. A good autoencoder is obvi- 
ously characterized by a low value of V and high value of 
S. However we remark that the two requirements Min{T>) 
and Max{S), for reasonable choices of F, are not generi- 
cally equivalent. 

Now we turn back to the clustering problem. Given a 
data-set {xi} of cardinality N, partitioning these points 
in q classes corresponds, in this frame, to design an au- 
toencoder, with q code indexes, acting on data space. We 
choose the encoder to be deterministic: 



P{a\x) = &aa{x), 



(6) 



(j{x) G {1, . . . , g} being the code index associated to x. The 
estimate for the average distorsion (|^), based on the data- 
set at hand, is given by 2? = NHd[cf\, where we introduce 
the hamiltonian Hd for the Potts variables {ci}: 



HdW] = 2^ — ^^1^- 

a=l 2-^k=l "aok 



(7) 



where Ui — a{xi), dij — d(xi,Xj). It turns out that Hd 
is equivalent to the cost function for pairwise clustering, 
influential in the clustering literature, introduced in [ p^ . 

The estimate for the average similarity is, similarly, given 
by iS = -~NHs[<j], where we introduce the hamiltonian Hg: 



1 A S s- 



a=l l^k=l "a 



(8) 



If we choose the autoencoder by minimizing the aver- 
age distorsion, then the best partition of the data-set in 
q classes corresponds to the ground state of Hd- If we 
choose it by maximizing the average similarity, then the 
ground state of Hs must be sought for, instead. Since both 
{dy } and {sy} may be taken positive, it follows that Hd is 
characterized by antiferromagnetic couplings between the 
Potts variables, while Hs is made of ferromagnetic cou- 
plings. Denominators in both Hd and Hs serve to enforce 
the coherence among the q clusters. In particular, without 
the denominator the ground state of Hs would correspond 
to a single big cluster. 

The form of the function F, determining the relation 
between s and d, has to be specified. In what follows we 
consider two forms of this relation. A scale-free relation 



Sij — F^{dij) 



{d) 



(9) 



depending on the exponent 7, and a scale-dependent rela- 
tion 



(10) 



dependent on the scale a. In the formulas above, (d) is the 
average dissimilarity over all the pairs of data-set points. 
The exponent 7 will be restricted to assume small values so 
as to characterize the corresponding Potts model by long- 
range ferromagnetic couplings; the scale parameter a will 
be bounded in [0, 1]. 

At this point it is worth stressing that minimization of 
the distorsion and maximization of the similarity yield, in 
the autoencoder frame, different cost functions. The hamil- 
tonian Hd embodies the requirement that pairs of distant 
points (large dij) should belong to different clusters. On the 
other hand, the hamiltonian Hg, for reasonable choices of 
F, concentrates on pairs of close points (small d) and forces 
them to belong to the same cluster. In other words, Hs 
may be seen to implement the idea that clusters should be 
searched for as dense connected regions in the data space. 

We describe now the application of the variational crite- 
rions for clustering, described above, to some artificial and 
real data-sets. We consider two optimization algorithms to 
find the configuration of minimum cost: simulated anneal- 
ing [0 and mean- field annealing jl^ . Both approaches as- 
sociate a Gibbs probability distribution to the functional to 
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be optimized. Simulated annealing is a Monte-carlo tech- 
nique which samples the Gibbs distribution as the tempera- 
ture is reduced to zero, while mean- field annealing attempts 
to track an approximation, to the mean of the distribution, 
known as mean field approximation |19|. We remark that 
an efficient mean-field annealing algorithm for cost function 
, based on the EM scheme ||2^ , is described in : the 
generalization of that algorithm to (||) is straightforward. 
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FIG. 1. (a) An artificial data set made of two Gaussian clus- 
ters, each consisting of 100 points. Empty squares and black 
circles refer to the two different clusters, (b) Clustering result 
obtained by minimization of Hd (see the text). 

In many cases cost functions Hd and Hs have very close 
global minima. For example in Fig. la we depict an artificial 
data-set generated by two overlapping isotropic Gaussian 
distributions. In this case the natural measure of dissim- 
ilarity is Euclidean metrics, and we use q = 2. In Fig. lb 
the corresponding ground state of Hd ||2^ is depicted: it 
is very close to the Bayesian solution, i.e. the solution ob- 
tained drawing the symmetry plane for the centers of the 
two Gaussians. A similar partition is obtained minimizing, 
by simulated annealing, Hg. As a measure of the differ- 
ence between two partitions {ci} and {rji}, we evaluate the 
following quantity: 
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N{N 
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which counts the number of pairs of points upon which 
the two partitions disagree. Using the scale-dependent Fa, 
we find the ground state of Hg to differ from those of Hd 
by e < 0.01 varying a in [0.05,1]. Analogously, using the 
scale-free F^, with 7 G [0.1, 1.5], we find e < 0.02 when we 
compare the ground state of Hg with those of Hd- Hence, 
on this data set, the cost functions introduced above work 
similarly within wide ranges of 7 and a values. 

We find a similar behaviour with respect to the famous 
IRIS data of Anderson |22|. This data set has often been 
used as a standard for testing clustering algorithms: it con- 
sists of three clusters (Virginica, Versicolor and Setosa) 
and there are 50 objects in R'' per cluster. Two clus- 



ters (Vcrginica, Versicolor) are very overlapping. The clus- 
tering result, with g = 3 and minimizing Hd, consists of 
three clusters of 61, 39 and 50 points respectively, with 
90% of correct classification percentage. We obtain exactly 
the same partition by minimizing Hs using a scale-free F 
(with 7 G [0.15, 1.45]), and using a scale-dependent F (with 
a € [0.25,1]). For a £ [0.1,0.25] we obtain, in the scale- 
dependent slightly different partition with clusters' 
sizes 58, 42, 50 and correct classification percentage 93.3%. 
These results show that also in the IRIS case the pairwise 
clustering procedures by distorsion minimization and sim- 
ilarity maximization are almost equivalent. 
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FIG. 2. (a) An artificial data set made of an elongated cluster 
of 500 points (empty circles) and a circular cluster of 200 points 
(black circles), (b) Partition by minimizing Ha- (c) Partition 
by minimizing scale dependent Hs with a < 0.7. (d) Partition 
by minimizing scale dependent Hs with a > 0.7. 

A typical situation resulting in different answers from 
Hd and Hs is depicted in Fig. 2a. This two-dimensional 
data-set is made of an elongated cluster and a Gaussian 
distributed circular one. It is evident that two dense con- 
nected regions are present, and that the farthest pairs of 
points belong to the same connected region. This is the 
type of data-set such that minimizing the distorsion is not 
equivalent to maximizing the similarity. In fig. 2b the par- 
tition we obtain minimizing Hd is depicted: it fails to rec- 
ognize the structure in the data-set. Let us now consider 
the ground state of Hg with the scale-dependent F. For 
a < 0.7 the ground state, depicted in Fig. 2c, recognizes 
with 99% accuracy the data structure. At a ~ 0.7 a tran- 
sition phenomenon occurs: the configuration depicted in 
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Fig. 2c ceases to be the global minimum, the new ground 
state (Fig. 2d) being very close to the solution by Hd- 
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FIG. 3. (a) The efhciency (percentage of correctly classified 
points) versus a, obtained on the data-set depicted in Fig. 2 by 
minimizing Hs with scale-dependent F. The dashed line is the 
efficiency obtained by minimization of Hd. (b) The e parame- 
ter, (see the text) between partitions corresponding to adjacent 
values of a, is plotted versus a . (c) The size of the two output 
clusters versus a. 

In Fig. 3a we depict the efficiency of the classification 
versus the resolution parameter a, for the scale dependent 
_F, while in Fig. 3b we consider a sequence of a-values and 
we plot the e between partitions corresponding to adjacent 
values of a. The peak at a = 0.7 is the indicator of the 
transition between global minima. Finally, in Fig. 3c the 
size of the two clusters, versus a, is depicted. Concerning 
the scale-free F, in Fig. 4 the same plots as in Fig. 3 are 
depicted, showing that the good minimum is stable for a 
wide range of 7. The choice of the optimization algorithm 



find a configuration very close to the one from simulated 
annealing, while spending less computational time. This 
confirms that optimization algorithms rooted on mean-field 
theory yield quickly a good solution on these problems Q . 

In summary, we address non-hierarchical pairwise clus- 
tering and, working in the probabilistic autoencoder frame, 
we introduce a class of cost functions arising from the re- 
quest of maximal average similarity between the input and 
the output of the autoencoder. Our simulations show that 
the partition provided by these new cost functions corre- 
sponds to extract dense connected regions in data space, 
and that a relevant discrepancy with the partition provided 
by the cost function introduced in is to be expected in 
case of non-trivial geometry of clusters. We note that the 
approach to clustering here described has some similarities 
with the method in 0: indeed in both cases clustering is 
mapped onto a ferromagnetic Potts model with couplings 
decreasing with the distance. In the superparamagnetic ap- 
proach, however, q is not related to the number of classes 
present in the data-set and one obtains hierarchical clus- 
tering as the temperature of the Potts model is varied. In 
the present case q is the number of classes, which is sup- 
posed to be known (non- hierarchical clustering), and the 
denominators in the hamiltonian, ensuring clusters 's coher- 
ence, leads to a non-trivial ground state which reflects data 
structure. We consider two classes of cost function. Scale- 
free cost functions depend on the exponent 7, while scale- 
dependent ones depend on the scale-parameter a. Varying 
a, i.e. changing the resolution at which the data-set is 
processed, may give rise to transitions between different 
partitions; in the scale-free case, the clustering output is 
fairly stable, with respect to 7, in a wide range. 

Further work will be devoted to test these new cost func- 
tions on other real applications and to study related issues, 
such as the introduction of an adaptive relation between 
distorsion and similarity, i.e. the function s = F{d) might 
be depending on the properties of the data-set in a neigh- 
bourhood of the pair of points under consideration. It will 
be also important to develop cluster-validity criterions to 
provide a means to choose an optimal q value in situations 
where the number of classes is ambiguous. 
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FIG. 4. (a) The efficiency (percentage of correctly classified 
points) versus 7, obtained on the data-set depicted in Fig. 2 
by minimizing Hs with scale-independent F. The dashed line 
is the efficiency obtained by minimization of Hd- (b) The e 
parameter, (see the text) between partitions corresponding to 
adjacent values of 7, is plotted versus 7 . (c) The size of the 
two output clusters versus 7. 

deserves a comment. All the results described above are 
obtained by simulated annealing; we also apply the mean- 
field annealing scheme, described in and we always 
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